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Background / Context: 

Description of prior research and its intellectual context. 

A large and growing literature demonstrates that teacher effects on academic achievement are 
substantial in size (Clotfelter, Ladd, and Vigdor, 2006; Rivkin, Hanushek, and Kain, 2005; 
Rowan, Correnti, and Miller, 2002). Moreover, this research finds that little of the variation in 
teacher effectiveness can be explained by observable characteristics such as certification, 
education, and experience (Aaronson, Barrow, and Sander, 2007; Hanushek, Kain, O’Brien, and 
Rivkin, 2005; Kane, Rockoff, and Staiger, 2008; Rockoff 2004). Motivated by these findings, 
policymakers have sought to require that teachers' evaluation, pay, and tenure be tied directly or 
indirectly to measures of their “value-added” to achievement on standardized tests. 

Mounting evidence on school responses to test-based accountability, however, suggests that 
school behaviors can diminish the validity of test score gains associated with such systems. 
Following earlier analyses of trends on “high-stakes” tests (Klein et al., 2000; Koretz et al., 

1991; Koretz and Barron, 1998; Linn, 2000), recent studies have concluded that the gains on 
these tests significantly outpace those on national benchmark tests such as the NAEP, with the 
gains in some cases almost four times as large (Center on Education Policy, 2008; Fuller et al., 
2007; Koretz and Barron, 1998; Klein et al., 2000; Jacob, 2007). A potential explanation for 
these differences can be found in studies of strategic responses to test-based accountability. 

These studies address a wide range of activities that inflate perceptions of achievement gains, 
including the re-classification of students as requiring special education (Figlio and Getzler, 

2002; Jacob, 2005), strategic exemption of students from testing (Cullen and Reback, 2006; 
Jacob, 2005; Jennings and Beveridge, 2009), re-allocation of resources toward students on the 
margin of passing (Booher-Jennings, 2005; Reback, 2008; Neal and Schanzenbach, 2007), 
suspension of low-scoring students near the test date (Figlio, 2005), and “teaching to the test” 
(Jacob 2005, 2007). 

Perhaps surprisingly, these two bodies of research have remained largely separate. To date, 
research on teacher productivity has not investigated the potential impact of accountability 
systems on the validity of “high-stakes” test score gains as a primary measure of teacher 
effectiveness. Yet is plausible that teachers who appear effective on these tests may not be 
deemed similarly effective on a second, low-stakes test of the same subject, particularly when 
that test covers a broader and less predictable range of skills. 

In this paper, we use data from the Houston Independent School District to estimate teacher 
effects on two different academic tests of the same subject areas, administered in the same school 
year to the same students at approximately the same time of year. The first is the statewide 
“high-stakes” test administered as part of the Texas accountability system, while the second is a 
nationally-normed “low-stakes” test intended as both an audit test and as a grade promotion tool. 

Building on past work, we estimate the size of teacher effects on each of these tests, and examine 
how these effects relate to each other, and differentially persist over time. 
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Only three studies that we are aware of have considered how teacher effects vary across multiple 
achievement measures. In a paper similar to ours, Papay (forthcoming) estimated teacher effects 
in a large urban district on three different reading tests: the state accountability test, the Stanford 
Achievement Test, and SRI. He found weak to modest correlations in teacher effects across 
tests, ranging from 0. 15 to 0.58, even when the tests were administered to the same sets of 
students. Papay carefully explored several hypotheses for these differences across tests and 
concluded that test timing played an important role (we discuss such mechanisms in the next 
section). Similarly, Lockwood et al. (2007) estimated teacher effects separately for the subscales 
of the Stanford math test. They found that differences in the choice of subscale measure have 
large effects on a teacher's effectiveness measure. Finally, in a paper not directly interested in 
teacher effects on different tests, Sass (2008) found a correlation of 0.48 between teacher effects 
estimated on the high- and low-stakes tests in Florida. Importantly, none of the papers cited here 
considered how the incentive effects of test-based accountability impact value-added measures 
of effectiveness. 

Purpose / Objective / Research Question / Focus of Study: 

Description of the focus of the research. 

We use data from the Houston Independent School District to estimate teacher effects on two 
different academic tests of the same subject areas, administered in the same school year to the 
same students at approximately the same time of year. The first is the statewide “high-stakes” 
test administered as part of the Texas accountability system, while the second is a nationally- 
nonned “low-stakes” test, intended as both an audit test and as a grade promotion tool. For 
reasons explained below, we focus on achievement in reading and math in the 4th and 5th grade. 

Given these two effectiveness measures, we address the following questions: (1) Do these 
estimates of teacher effectiveness suggest a similar level of variation in quality across teachers? 
(2) How strongly are these two measures correlated? Is it the case that teachers who appear 
effective on a “high-stakes” state test are similarly effective on a “low-stakes” test of the same 
subject? (3) Is one measure of teacher effectiveness more stable from year to year than the other? 
(4) Are there differences in decay rates in teacher effects on high- and low-stakes tests? (5) To 
what extent does the high- and low-stakes nature of the test contribute to these differences? 

Setting: 

Description of the research location. 

For this paper we drew from a longitudinal dataset of all students tested in the Houston 
Independent School District (HISD) between 1998 and 2006, approximately 165,000 per year. 
HISD is the seventh largest school district in the country and the largest in the state of Texas. 
Fifty-nine percent of its students are Hispanic, while 29% are African-American, 8% are 
Caucasian, and 3% are Asian-American. Close to 80 percent of students are considered by the 
state to be economically disadvantaged, 27% are classified as Limited English Proficient, and 
1 1% are classified as receiving special education. 

HISD is an ideal setting for this study in that they administer multiple assessments each year to 
most students: the mandatory Texas state assessments (TAAS or TAKS) and the Stanford 
Achievement Test. The TAKS is administered to students in grades 3 to 1 1 in reading/ELA, 



2011 SREE Conference Abstract Template 



A-2 




mathematics, writing, science, and social studies, though reading and math are the only subjects 
tested every year between grades 3 and 8. The Stanford Achievement Test is administered to all 
students grades 1 to 1 1 in reading, math, language, science, and social science. 

Population / Participants / Subjects: 

Description of the participants in the study: who, how many, key features or characteristics. 

Our interest in value-added measures on multiple tests restricts the data we are able to use. First, 
only grades and subjects where both the TAAS/TAKS and Stanford were given were considered 
(grades 3-8, reading and math). Second, the need for a lagged achievement measure on both 
tests eliminated grade 3 (the first tested on TAAS/TAKS) and 1998 (our first year of data). 

Third, an accurate match of students to classroom teachers limited us to grade 4 and 5 students in 
self-contained classes. Fourth, students were excluded who were missing test scores — in most 
cases due to purposeful exclusion. Taken together, our analysis focuses on 4th and 5th grade 
students tested in math and reading on both tests, approximately 27,000 per year between 1999 
and 2006. There are 2,100 to 2,600 unique teachers per grade represented in the analysis, 
depending on the grade and subject. 

Intervention / Program / Practice: 

Description of the intervention, program or practice, including details of administration and duration. 

Like all Texas school districts, HISD has administered annual state assessments (the TAAS or 
TAKS) since the 1980s. The TAAS was a minimum competency test given from 1991 until 
2003, when it was replaced by the TAKS, a criterion-referenced test (Jennings and Beveridge, 
2009; Koedel and Betts, 2009). In 1996, HISD added the Stanford Achievement Test under 
pressure from a business task force that sought a nationally-nonned benchmark (McAdams, 
2000). Since that time, all students have been required to take the Stanford. 

TAAS/TAKS is HISD's “high-stakes” test for several reasons. First, passing rates on these tests 
have been an integral part of Texas' accountability system for years (Reback, 2008). Schools and 
districts are rewarded or punished based on these test results. Second, HISD has operated a 
performance pay plan since 2000 that provides monetary rewards to schools and teachers for 
TAAS/TAKS results. Third, Texas uses the TAKS to award grade promotion in grades 3 and 5. 

The Stanford can be considered HISD's “low-stakes” test, in that it is not tied to the state 
accountability system. However, the test plays several important roles in the district. For 
example, it is used as one criteria for grade promotion in grades 1-8. In addition, the Stanford is 
used to aid in the placement of students in specific programs, including gifted and special 
education. School-level results are publicly reported in the local media, and in recent years 
value-added measures on the Stanford were integrated into HISD's performance pay plan. 

Despite their disparate uses, reading and math skills covered on the TAKS and Stanford are 
broadly similar. 

Research Design: 

Description of research design (e.g., qualitative case study, quasi-experimental design, secondary analysis, analytic 
essay, randomized field trial). 
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This is a secondary analysis of longitudinal student-level achievement data from the Houston 
Independent School District, in which students are matched to their classroom teachers. 
Following earlier work on teacher value-added modeling, we estimate teacher effects on 
achievement using a covariate adjustment random effects model with prior year test scores 
included as a key explanatory variable. Additional details are provided in the next section. 

Data Collection and Analysis: 

Description of the methods for collecting and analyzing data. 

For this paper we constructed a longitudinal dataset of all students tested in the Houston 
Independent School District (HISD) between 1998 and 2006. Our analysis focuses on 4th and 
5th grade students tested in math and/or reading on both the Texas state assessment (TAAS or 
TAKS) and the Stanford Achievement Test, approximately 27,000 students each year. All 
students are matched to demographic and program participation data, including age, gender, 
race/ethnicity, recent immigrant and migrant status, economic disadvantage, Limited English 
Proficiency, and special education status. 

Following Gordon, Kane, and Staiger (2006), Kane, Staiger, and Rockoff (2008), Papay 
(forthcoming), Jacob and Lefgren (2008) and others, we estimate individual teacher effects on 
achievement using the following student-level model, separately for each test (TAAS/TAKS and 
Stanford) and subject (math and reading): 

( 1 ) Yy t — PgXij t + c, g Cj t + y g Sp + A g Wp + n gt + uy t 

where Yy, represents the score for student i in classroom j in year t. Xy, is a vector of fixed and 
time-varying characteristics of student i, most importantly a cubic function of prior year 
achievement in both subjects on the same test series (TAAS/TAKS or Stanford). Q t and Sp are 
vectors of average classroom and school characteristics, and Wp represents teacher characteristics 
including experience and highest degree. n gt is a fixed grade-by-year effect. In some cases we 
include school fixed effects, and all coefficients are allowed to vary by grade. 

We assume the error term uy t in (1) — the extent to which student V s test score differs from that 
predicted given her past score, individual, classroom, school, and teacher characteristics — can be 
decomposed into variation due to persistent teacher effectiveness d , and an unexplained 
component v it : 

(2) Uyt 9/ Vit 

The parameters of interest are the Sp or the persistent “teacher effects.” Although the Sj can be 
estimated as fixed effects, we take the approach followed by Kane, Staiger, and Rockoff (2008) 
and others and treat these parameters as random effects, which we adjust for sampling variation 
using empirical Bayes shrinkage (Raudenbusch and Bryk, 2002; Jacob and Lefgren, 2008). 

Our estimates of the S/ make use of all available data for each teacher, which can include as 
many as 8 years of classroom data and 225 valid students. In examining properties of teacher 
effects such as inter-temporal stability and correlation with time -varying factors such as 
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experience, we estimate a version of (1) using teacher-by-year (or classroom) effects for each 
subject and test. The resulting Sj estimates are used to address the research questions outlined 
above. 

Findings / Results: 

Description of the main findings with specific details. 

Our results indicate that teacher effects on the high - stakes test vary substantially more than those 
in the same subject on the low-stakes test (insert Figure 1 here). For the TAAS/TAKS, we find a 
standard deviation in teacher quality of 0.23 in reading and 0.28 in math. In contrast, the standard 
deviation is nearly half this size on the Stanford: 0.13 and 0.15, respectively. Teacher effects on 
different tests of the same subject in the same year are only modestly correlated, at 0.61 in math 
and 0.52 in reading. Figure 2 expresses this correlation another way, showing the proportion of 
teachers in each quintile of effectiveness on one test that ranked in quintiles 1-5 on the second 
test (insert Figure 2 here). As an illustration, we find only 48 percent of teachers in the top 
quintile of the TAKS math test were also in the top quintile of the Stanford test. A non-trivial 
share (13%) ranked among the lowest two quintiles of the Stanford. We find very little 
difference across the two tests in inter-temporal stability. 

Perhaps more importantly, we find that teacher effects on the high-stakes TAKS test decay at a 
much faster rate than those on the low-stakes Stanford test. Using the method proposed by 
Jacob, Lefgren, and Sims (forthcoming), we estimated the persistence of teacher-induced gains 
on achievement in later grades and found that 34% of a teacher’s effect on grade 4 mathematics 
carried through to grade 5, as measured by the Stanford test, while only 16% of her effect on 
achievement persisted as measured on the high-stakes TAKS test. The corresponding numbers 
in reading were 31% and 20%. 

Finally, we find important differences in the impact of teacher observables on student 
performance across the two tests. The returns to teacher experience are compressed on the high- 
stakes test, such that the majority of the returns occur in the first 2 to 3 years. In contrast, we 
find positive returns to experience on the low-stakes reading test throughout the first 1 5 years of 
teachers' careers. 

Conclusions: 

Description of conclusions, recommendations, and limitations based on findings. 

If our estimates of teacher effects could be considered causal effects on student achievement, the 
high-stakes state assessments and low-stakes Stanford Achievement Test would offer very 
different evidence about the overall variation in teacher quality and the relative contribution of 
teachers to test outcomes. Moreover, differences in these sets of estimates have implications for 
value-added based systems in practice. Rewards and sanctions linked to student perfonnance on 
one test may yield quite different results when applied to a different test of very similar content. 
More research is needed on the extent to which “high-stakes” testing alters teacher behavior 
(relative to a low-stakes test), such that value-added based estimates of effectiveness are 
compromised. 
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Figure 1 : Distribution of teacher effects: 4th and 5th grade mathematics and reading 



Mathematics 




Reading 




TAAS/TABCS Stanford 



2011 SREE Conference Abstract Template 



B-l 



Figure 2: Quintiles of value-added on Stanford mathematics test, by quintile on TAKS math test 
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